A Comparative Study of Generative Models for Document Clustering

نویسندگان

  • Shi Zhong
  • Joydeep Ghosh
چکیده

Generative models based on the multivariate Bernoulli and multinomial distributions have been widely used for text classification. Recently, the spherical k-means algorithm, which has desirable properties for text clustering, has been shown to be a special case of a generative model based on a mixture of von Mises-Fisher (vMF) distributions. This paper compares these three probabilistic models for text clustering, both theoretically and empirically, using a general model-based clustering framework. For each model, we investigate three strategies for assigning documents to models: maximum likelihood (k-means) assignment, stochastic assignment, and soft assignment. Our experimental results over a large number of datasets show that, in terms of clustering quality, (a) The Bernoulli model is the worst for text clustering; (b) The vMF model produces better clustering results than both Bernoulli and multinomial models; (c) Soft assignment leads to comparable or slightly better results than hard assignment. We also use deterministic annealing (DA) to improve the vMF-based soft clustering and compare all the model-based algorithms with the state-of-the-art discriminative approach to document clustering based on graph partitioning (CLUTO) and a spectral co-clustering method. Overall, CLUTO and DA perform the best but are also the most computationally expensive; the spectral coclustering algorithm fares worse than the vMF-based methods.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Exploring Differential Topic Models for Comparative Summarization of Scientific Papers

This paper investigates differential topic models (dTM) for summarizing the differences among document groups. Starting from a simple probabilistic generative model, we propose dTM-SAGE that explicitly models the deviations on group-specific word distributions to indicate how words are used differentially across different document groups from a background word distribution. It is more effective...

متن کامل

یک مدل موضوعی احتمالاتی مبتنی بر روابط محلّی واژگان در پنجره‌های هم‌پوشان

A probabilistic topic model assumes that documents are generated through a process involving topics and then tries to reverse this process, given the documents and extract topics. A topic is usually assumed to be a distribution over words. LDA is one of the first and most popular topic models introduced so far. In the document generation process assumed by LDA, each document is a distribution o...

متن کامل

A Comparative Study between a Pseudo-Forward Equation (PFE) and Intelligence Methods for the Characterization of the North Sea Reservoir

This paper presents a comparative study between three versions of adaptive neuro-fuzzy inference system (ANFIS) algorithms and a pseudo-forward equation (PFE) to characterize the North Sea reservoir (F3 block) based on seismic data. According to the statistical studies, four attributes (energy, envelope, spectral decomposition and similarity) are known to be useful as fundamental attributes in ...

متن کامل

A hierarchical model for clustering

We propose a new hierarchical generative model for textual data, where words may be generated by topic speciic distributions at any level in the hierarchy. This model is naturally well-suited to clustering documents in preset or automatically generated hierarchies, as well as categorising new documents in an existing hierarchy. Training algorithms are derived for both cases, and illustrated on ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003